Apache Spark

Apache Spark is an open source cluster computing platform designed to be fast and general purpose for large scale data processing.

On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.

Spark provides reliable in-memory performance. Iterative algorithms are faster as data is not being written to disk between jobs. Spark processes data 10 times faster than MapReduce on disk and 100 times faster in memory.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries.

Apache Spark

Apache Spark

results matching ""

No results matching ""